Wrapper Maintenance

نویسندگان

  • Kristina Lerman
  • Craig A. Knoblock
چکیده

A Web wrapper is a software application that extracts information from a semi-structured source and converts it to a structured format. While semi-structured sources, such as Web pages, contain no explicitly specified schema, they do have an implicit grammar that can be used to identify relevant information in the document. A wrapper learning system analyzes page layout to generate either grammar-based or “landmark”-based extraction rules that wrappers use to extract data. As a consequence, even slight changes in the page layout can break the wrapper and prevent it from extracting data correctly. Wrapper maintenance is a composite task that (1) verifies that the wrapper continues to extract data correctly from a source, and (2) repairs the wrapper so that it works on the changed pages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Wrapper Generation and Maintenance

This paper investigates automatic wrapper generation and maintenance for Forums, Blogs and News web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. The tree alignment algorithm is adopted...

متن کامل

Wrapper Maintenance: A Machine Learning Approach

The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wr...

متن کامل

Learning the Common Structure of Data

The proliferation of online information sources has accentuated the need for tools that automatically validate and recognize data. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe two Web wrapper maintenance applications that employ this algorithm. The first application detects when a wrapper is not extracting correct data...

متن کامل

Regression testing for wrapper maintenance

Recent work on Internet information integration ~sumes a library of wrappers, specialized information extraction procedures. Maintaining wrappers is difficult, because the formatting regularities on which they rely often change. The wrapper verification problem is to determine whether a wrapper is correct. Standard regression testing approaches are inappropriate, because both the formatting reg...

متن کامل

Wrapper Induction and Maintenance in Documentum ECI

Documentum Enterprise Content Integration (ECI) services is a content integration middleware that provides one-query access to the Intranet and Internet content resources. The ECI Adapter technology offers an interface to any application for data and metadata extraction from unstructured Web pages. It offers a unique framework of wrapper production, automatic recovery and maintenance, developed...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009